DSCI 573 - Feature and Model Selection

Lab 4: A mini project: Putting it all together

Table of contents

  1. Submission instructions
  2. Understanding the problem
  3. Data splitting
  4. EDA
  5. (optional) Feature engineering
  6. Preprocessing and transformations
  7. Baseline model
  8. Linear models
  9. Different classifiers
  10. (optional) Feature selection
  11. Hyperparameter optimization
  12. Interpretation and feature importances
  13. Results on the test set
  14. Summary of the results
  15. (optional) Your takeaway from the course

Submission instructions


rubric={mechanics:2}

You will receive marks for correctly submitting this assignment.

To correctly submit this assignment follow the instructions below:

Here you will find the description of each rubric used in MDS.

NOTE: The data you download for use in this lab SHOULD NOT BE PUSHED TO YOUR REPOSITORY. You might be penalised for pushing datasets to your repository. I have seeded the repository with .gitignore and hoping that it won't let you push CSVs.

Introduction

In this lab you will be working on an open-ended mini-project, where you will put all the different things you have learned so far in 571 and 573 together to solve an interesting problem.

A few notes and tips when you work on this mini-project:

Tips

  1. This mini-project is open-ended, and while working on it, there might be some situations where you'll have to use your own judgment and make your own decisions (as you would be doing when you work as a data scientist). Make sure you explain your decisions whenever necessary.
  2. Do not include everything you ever tried in your submission -- it's fine just to have your final code. That said, your code should be reproducible and well-documented. For example, if you chose your hyperparameters based on some hyperparameter optimization experiment, you should leave in the code for that experiment so that someone else could re-run it and obtain the same hyperparameters, rather than mysteriously just setting the hyperparameters to some (carefully chosen) values in your code.
  3. If you realize that you are repeating a lot of code try to organize it in functions. Clear presentation of your code, experiments, and results is the key to be successful in this lab. You may use code from lecture notes or previous lab solutions with appropriate attributions.

Assessment

We plan to grade fairly and leniently. We don't have some secret target score that you need to achieve to get a good grade. You'll be assessed on demonstration of mastery of course topics, clear presentation, and the quality of your analysis and results. For example, if you just have a bunch of code and no text or figures, that's not good. If you do a bunch of sane things and get a lower accuracy than your friend, don't sweat it.

A final note

Finally, this style of this "project" question is different from other assignments. It'll be up to you to decide when you're "done" -- in fact, this is one of the hardest parts of real projects. But please don't spend WAY too much time on this... perhaps "a few hours" (2-8 hours???) is a good guideline for a typical submission. Of course if you're having fun you're welcome to spend as much time as you want! But, if so, try not to do it out of perfectionism or getting the best possible grade. Do it because you're learning and enjoying it. Students from the past cohorts have found such kind of labs useful and fun and I hope you enjoy it as well.

1. Understanding the problem


rubric={accuracy:1,reasoning:2}

In this mini project, you will be working on a classification problem of predicting whether a credit card client will default or not. For this problem, you will use Default of Credit Card Clients Dataset. In this data set, there are 30,000 examples and 24 features, and the goal is to estimate whether a person will default (fail to pay) their credit card bills; this column is labeled "default.payment.next.month" in the data. The rest of the columns can be used as features. You may take some ideas and compare your results with the associated research paper, which is available through the UBC library.

Your tasks:

  1. Spend some time understanding the problem and what each feature means. You can find this information in the documentation on the dataset page on Kaggle. Write a few sentences on your initial thoughts on the problem and the dataset.
  2. Download the dataset and read it as a pandas dataframe.

Answer 1.1

This is a classifier problem. Target column is default.payment.next.month. "1" stands for yes, which is the class we are more interested in. So 'recall' or 'f1' may be appropriate for the scoring method. There are 24 features. All of them are of numeric data type. Some of the columns are binary such as sex, and some of them are categorical features that are already encoded with orders, such as education, so they should be passed through without any transformation. The dataset is large enough with 30,000 examples so it does not seem to be a large concern for optimization bias of the validation set.

The dataset is from Taiwan so any model developed based on the data may not be appropriate for use in other countries. In addition, the data was collected back in 2005 so it could be outdated. The model should be used with caution.

2. Data splitting


rubric={reasoning:2}

Your tasks:

  1. Split the data into train and test portions.

3. EDA


rubric={viz:4,reasoning:4}

Your tasks:

  1. Perform exploratory data analysis on the train set.
  2. Include at least two summary statistics and two visualizations that you find useful, and accompany each one with a sentence explaining it.
  3. Summarize your initial observations about the data.
  4. Pick appropriate metric/metrics for assessment.

Observatiion 1.

From the above information, we can see that there are no missing value in the dataset.

Observation 2.

We have class imbalance because class "1" takes up only 22.21% of the entire train_df. Since class "1" is what we are interested in classifying, I would go with f1 score.

Observation 3.

It does not seem that there is a correlation between age and credit limit.

Observation 4.

Age does not seem to be a useful predictor in default payments because both classes present similar distribution in all age groups.

Observation 5.

Credit Limit also does not seem to be a particular useful predictor in default payments because both classes present similar distribution in all credit limit group.

Summary of Initial Observation.

  1. There are no missing value, so simple inputer is not needed in data transformation.
  2. There is class imbalance, and since we are more interested in one of the classes, I would use f1 score as scoring metric.

(optional) 4. Feature engineering


rubric={reasoning:1}

Your tasks:

  1. Carry out feature engineering. In other words, extract new features relevant for the problem and work with your new feature set in the following exercises. You may have to go back and forth between feature engineering and preprocessing.

5. Preprocessing and transformations


rubric={accuracy:4,reasoning:4}

Your tasks:

  1. Identify different feature types and the transformations you would apply on each feature type.
  2. Define a column transformer, if necessary.

6. Baseline model


rubric={accuracy:2}

Your tasks:

  1. Try scikit-learn's baseline model and report results.

7. Linear models


rubric={accuracy:4,reasoning:2}

Your tasks:

  1. Try logistic regression as a first real attempt.
  2. Carry out hyperparameter tuning to explore different values for the regularization hyperparameter.
  3. Report validation scores along with standard deviation.
  4. Summarize your results.

Answer 7.4

LogisticRegression with optimized C returned a better f1 in both train and test score than the baseline DummyClassifier.
With Randomized Hyperparameter Optimization, the regularization hyperparameter C is determined to be 0.01.

8. Different classifiers


rubric={accuracy:4,quality:2,reasoning:4}

Your tasks:

  1. Try at least 3 other models aside from logistic regression.
  2. Summarize your results. Can you beat logistic regression?

8.2 Answer

RandomForestClassifier, XGBoost and LightGBM are tried aside from LogisticRegression. None of the models can beat LogisticRegression's validation score.

(optional) 9. Feature selection


rubric={reasoning:1}

Your tasks:

Make some attempts to select relevant features. You may try RFECV, forward selection or L1 regularization for this. Do the results improve with feature selection? Summarize your results. If you see improvements in the results, keep feature selection in your pipeline. If not, you may abandon it in the next exercises.

Answer 9.

The validation f1 score did not improve with L1 regularization, but the gap between training and validation scores is smaller now. I will ignore the L1 regularization in the next exercises.

10. Hyperparameter optimization


rubric={accuracy:4,quality:2,reasoning:4}

Your tasks:

Make some attempts to optimize hyperparameters for the models you've tried and summarize your results. In at least one case you should be optimizing multiple hyperparameters for a single model. You may use sklearn's methods for hyperparameter optimization or fancier Bayesian optimization methods.

Answer 10. Summary of Hyperparameter Optimization Results.

Hyperparameter Optimization was carried out to RandomForestClassifer, XGBoost and LightGBM.

11. Interpretation and feature importances


rubric={accuracy:3,viz:3,reasoning:4}

Your tasks:

  1. Use the methods we saw in class (e.g., eli5, shap) (or any other methods of your choice) to explain one of the best performing models. Summarize your observations.

Summary of Observation:

Force Plot

Logistic Regression is a linear model. Its expected value is -0.22 as shown above. Every example is used to calculate a score that is compared to the expected value. As above, I have used the the 11th example to produce a force plot. We can see that the score is 0.26. This score is higher than the base value and therefore the prediction is class 1 (will default next month).

The forces that drive the prediction towards class 1 includes: PAY_5, PAY_4, PAY_6, PAY_0. The forces that drive the prediction towards class 0 includes: LIMIT_BAL, PAY_6.

Summary Plot

The plot shows the most import features for predicting the classes. In this plot, we can see that:

Dependence Plot

Coefficient Table

12. Results on the test set


rubric={accuracy:2,viz:2,reasoning:4}

Your tasks:

  1. Try your best performing model on the test data and report test scores.
  2. Do the test scores agree with the validation scores from before? To what extent do you trust your results? Do you think you've had issues with optimization bias?
  3. Take one or two test predictions and explain them with SHAP force plots.

Answers:

12.2 - The test score of "f1" is 0.528 which is comparable to the validation score of 0.534. Since the train and test size is large, there is less chance of optimization bias.

Test example explanation with Force Plot

As above, I have used the the 1st test example to produce a force plot. We can see that the score is -0.43. This score is lower than the base value of -0.22 and therefore the prediction is class 0 (will not default next month). This is confirmed by prediction as above.

13. Summary of results


rubric={reasoning:6}

Your tasks:

  1. Create a table summarizing important results.
  2. Write concluding remarks.
  3. Discuss other ideas that you did not try but could potentially improve the performance/interpretability .
  4. Report your final test score along with the metric you used at the top of this notebook in the Submission instructions section.
Description Validation F1 Score Train F1 Score Observation
DummyClassifier 0.229 0.219 Base Line model
Logistic Regression with L1 regularization c=0.01 0.534 0.537 The best performing model
RandomForest 0.476 0.999 Overfit on training set
XGBoost 0.458 0.706 poor on both train and validation set
LightGBM 0.475 0.568 poor on both train and validation set
RandomForest - Hyperparameter Optimized 0.466 0.706 Validation score improved slightly
XGBoost - Hyperparameter Optimized 0.471 0.561 poor on both train and validation set
LightGBM - Hyperparameter Optimized 0.478 0.565 poor on both train and validation set

Remarks:

The dataset is mostly numeric. Since there is class imbalance, I have chosen to use f1 score for scoring metric.
Logistic regression with L1 regularization of C=0.01 seems to be the best performing model as show above. The validation f1 score is 0.534. This is significantly higher than the base line DummyClassifier. Most of the models have a validation f1 score around 0.47 but Logistic regression with L1 regularization seems to out perform them.

The test f1 score is 0.528. It is comparable to the validation score and because the dataset is large enough, there is less chance for optimization bias.

The coefficient table shows the most important feature according to the best model. It seems that the repayment status are more important than other features.

From SHAP value plots, we can see that there is an inverse relation between credit limit balance and SHAP value. This also mean that the higher the credit limit, the less likely a customer will default payment in the next month.

Other ideas we could try:

There are 24 features and since all of them are numeric, we could try polynomial feature transformation to make non-linearly separable data into linearly separable. We could also try RBF with L2 regularization to select more important feature to reduce overfitting of data.

(optional) 14. Your takeaway from the course


rubric={reasoning:1}

Your tasks:

What is your biggest takeaway from this course?

Note that there is no correct answer here but we would appreciate thoughtful answers.

My biggest takeaway from this course is that there will be a lot of researching on my own in the future such as reading source code, documentation and questions on StackOverflow when I get to work on machine learning in my job.

Submission to Canvas

PLEASE READ: When you are ready to submit your assignment do the following:

Well done!! Congratulations on finishing all labs for this course!!